Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Yingxin Zhang, UID: 006140202

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.1        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.7       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.7    xfun_0.41        
[13] digest_0.6.33     rlang_1.1.1       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
library(ggplot2)

Display your machine memory.

memuse::Sys.meminfo()
Totalram:  16.000 GiB 
Freeram:    9.167 GiB 

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Answer: Here I copy the Parquet format generated in Homework 2 to the current working directory hw3 and renamed it as labevents_pq. Then I import the necessary data files and generate the visualization for the patient with subject_id 10013310.

# import data
patients <- read_csv("~/mimic/hosp/patients.csv.gz")
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
admissions <- read_csv("~/mimic/hosp/admissions.csv.gz")
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
transfers <- read_csv("~/mimic/hosp/transfers.csv.gz")
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
labevents <- arrow::open_dataset("labevents_pq", format = 'parquet') |> 
  as_tibble()
procedures_icd <- read_csv("~/mimic/hosp/procedures_icd.csv.gz")
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
diagnoses_icd <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz")
Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_procedures <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz")
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_diagnoses <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz")
Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# filter data for patient 10013310
patient_id <- 10013310
patients_10013310 <- patients %>% filter(subject_id == patient_id)
admissions_10013310 <- admissions %>% filter(subject_id == patient_id)
transfers_10013310 <- transfers %>% filter(subject_id == patient_id)
labevents_10013310 <- labevents %>% filter(subject_id == patient_id)
procedures_icd_10013310 <- procedures_icd %>% filter(subject_id == patient_id)
diagnoses_icd_10013310 <- diagnoses_icd %>% filter(subject_id == patient_id)
# display the filtered data
patients_10013310
# A tibble: 1 × 6
  subject_id gender anchor_age anchor_year anchor_year_group dod       
       <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
1   10013310 F              70        2153 2017 - 2019       2153-11-19
admissions_10013310
# A tibble: 3 × 16
  subject_id hadm_id admittime           dischtime           deathtime
       <dbl>   <dbl> <dttm>              <dttm>              <dttm>   
1   10013310  2.12e7 2153-05-26 14:18:00 2153-06-05 19:30:00 NA       
2   10013310  2.21e7 2153-06-10 11:55:00 2153-07-21 18:00:00 NA       
3   10013310  2.77e7 2153-05-06 18:03:00 2153-05-13 13:45:00 NA       
# ℹ 11 more variables: admission_type <chr>, admit_provider_id <chr>,
#   admission_location <chr>, discharge_location <chr>, insurance <chr>,
#   language <chr>, marital_status <chr>, race <chr>, edregtime <dttm>,
#   edouttime <dttm>, hospital_expire_flag <dbl>
transfers_10013310
# A tibble: 14 × 7
   subject_id  hadm_id transfer_id eventtype careunit        intime             
        <dbl>    <dbl>       <dbl> <chr>     <chr>           <dttm>             
 1   10013310 21243435    31696219 discharge <NA>            2153-06-05 19:58:00
 2   10013310 21243435    31736720 ED        Emergency Depa… 2153-05-26 08:56:00
 3   10013310 21243435    33511674 transfer  Medicine/Cardi… 2153-05-26 16:19:26
 4   10013310 21243435    34848129 transfer  Medicine/Cardi… 2153-05-26 14:42:55
 5   10013310 21243435    38910974 admit     Medicine/Cardi… 2153-05-26 14:18:39
 6   10013310 22098926    31651850 transfer  Neuro Intermed… 2153-06-12 16:31:33
 7   10013310 22098926    32769810 admit     Neuro Surgical… 2153-06-10 11:55:42
 8   10013310 22098926    33278851 transfer  Medicine        2153-06-16 19:03:14
 9   10013310 22098926    34063502 ED        Emergency Depa… 2153-06-10 10:40:00
10   10013310 22098926    36029206 discharge <NA>            2153-07-21 18:02:28
11   10013310 27682188    30077870 transfer  Medicine/Cardi… 2153-05-07 20:47:19
12   10013310 27682188    30444898 discharge <NA>            2153-05-13 15:36:52
13   10013310 27682188    31203589 admit     Coronary Care … 2153-05-06 18:28:00
14   10013310 27682188    35160955 ED        Emergency Depa… 2153-05-06 10:21:00
# ℹ 1 more variable: outtime <dttm>
# add col unit to transfers, if careunit contains "ICU" or "CCU", then unit is "ICU/CCU", otherwise "non-ICU/CCU"
transfers_10013310 <- transfers_10013310 %>%
  mutate(unit = ifelse(str_detect(careunit, "ICU|CCU"), 5, 2.5))

# remove missing values in transfers_10013310
transfers_10013310 <- transfers_10013310 %>% drop_na()
# remove missing values in labevents_10013310
labevents_10013310 <- labevents_10013310 %>% drop_na()
# merge procedures_icd and d_icd_procedures
procedures_10013310 <- procedures_icd_10013310 %>%
  left_join(d_icd_procedures, by = "icd_code")

# change the colname "long_title" to "long_procedures"
colnames(procedures_10013310)[colnames(procedures_10013310) == "long_title"] <- "Procedure"

procedures_10013310$chartdate <- as.POSIXct(procedures_10013310$chartdate)
procedures_10013310
# A tibble: 9 × 8
  subject_id  hadm_id seq_num chartdate           icd_code icd_version.x
       <dbl>    <dbl>   <dbl> <dttm>              <chr>            <dbl>
1   10013310 21243435       1 2153-05-27 00:00:00 4A023N7             10
2   10013310 21243435       2 2153-05-27 00:00:00 B2111ZZ             10
3   10013310 21243435       3 2153-05-27 00:00:00 B241ZZ3             10
4   10013310 22098926       1 2153-06-10 00:00:00 03CG3ZZ             10
5   10013310 22098926       2 2153-06-10 00:00:00 3E05317             10
6   10013310 22098926       3 2153-07-15 00:00:00 0DH63UZ             10
7   10013310 22098926       4 2153-06-11 00:00:00 3E0G76Z             10
8   10013310 27682188       1 2153-05-06 00:00:00 027034Z             10
9   10013310 27682188       2 2153-05-06 00:00:00 B211YZZ             10
# ℹ 2 more variables: icd_version.y <dbl>, Procedure <chr>
# merge diagnoses_icd, transfers and d_icd_diagnoses
diagnoses_10013310 <- diagnoses_icd_10013310 %>%
  left_join(d_icd_diagnoses, by = "icd_code")

# change the colname "long_title" to "long_title_diagnoses"
colnames(diagnoses_10013310)[colnames(diagnoses_10013310) == "long_title"] <- "long_diagnoses"
# pull top 3 diagnoses
top3_diagnoses <- diagnoses_10013310 %>%
  count(long_diagnoses, sort = TRUE) %>%
  slice(1:3) %>%
  pull(long_diagnoses)

top3_diagnoses
[1] "Acute on chronic systolic (congestive) heart failure"
[2] "Hyperlipidemia, unspecified"                         
[3] "Long term (current) use of insulin"                  
# Draw the ADT history for patient 10013310
plot <- ggplot() +
  geom_segment(data = transfers_10013310, aes(x = intime, xend = outtime, 
    y = 3, yend = 3, color = careunit, size = unit), linetype = "dashed") +
  geom_point(data = labevents_10013310, aes(x = charttime, y = 2), shape = 3, size = 5) +
  geom_point(data = procedures_10013310, aes(x = chartdate, y = 1, shape = Procedure), size = 5) +
  labs(title = "Patient 10013310, F, 70 years old, BLACK/AFRICAN", 
       subtitle = paste(top3_diagnoses, collapse = "\n")) +
  xlab("Calendar Time") +
  theme(axis.title.y = element_blank(),
        panel.background = element_rect(fill = "white"),
        panel.grid = element_line(color = "gray", linewidth = 0.5),
        panel.border = element_rect(color = "black", fill = NA, linewidth = 0.8)) +  
  theme(axis.text = element_text(size = 12), 
        axis.title = element_text(size = 15), 
        legend.text = element_text(size = 8),
        legend.title = element_text(size = 10)) +
  scale_y_continuous(breaks = 1:3, labels = c("Procedure", "Lab", "ADT")) +
  coord_cartesian(ylim = c(1, 3)) +
  guides(size = FALSE)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ggsave("10013310_adt.png", plot, width = 12, height = 5)
Warning: The shape palette can deal with a maximum of 6 discrete values because
more than 6 becomes difficult to discriminate; you have 9. Consider
specifying shapes manually if you must have them.
Warning: Removed 3 rows containing missing values (`geom_point()`).

The visualization for the patient with subject_id 10013310 is shown below.

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

Answer: Here I generate the visualization for the patient with subject_id 10013310.

# Retrieve a subset of `chartevents.csv.gz` and `icustays` only containing patient 10013310
#| eval: false
zcat < ~/mimic/icu/chartevents.csv.gz | awk -F, 'NR == 1 || 
  ($1 == 10013310)' | gzip > chartevents_10013310.csv.gz
zcat < ~/mimic/icu/icustays.csv.gz | awk -F, 'NR == 1 || 
  ($1 == 10013310)' | gzip > icustays_10013310.csv.gz
# import icu data
chartevents_10013310 <- read_csv("./chartevents_10013310.csv.gz")
Rows: 7338 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): value, valueuom
dbl  (7): subject_id, hadm_id, stay_id, caregiver_id, itemid, valuenum, warning
dttm (2): charttime, storetime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_items <- read_csv("~/mimic/icu/d_items.csv.gz")
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# filter d_items and for the items of interest: abbreviation = HR, NBPd, NBPs, Temp, RR
d_item <- d_items %>% 
  filter(abbreviation %in% c("HR", "NBPd", "NBPs", "Temperature F", "RR")) %>% 
  select(itemid, abbreviation)

d_item
# A tibble: 5 × 2
  itemid abbreviation 
   <dbl> <chr>        
1 220045 HR           
2 220179 NBPs         
3 220180 NBPd         
4 220210 RR           
5 223761 Temperature F
# filter chartevents_10013310 for the items of interest
events_10013310 <- chartevents_10013310 %>% 
  filter(itemid %in% d_item$itemid) %>% 
  select(subject_id, stay_id, charttime, itemid, valuenum)

events_10013310
# A tibble: 549 × 5
   subject_id  stay_id charttime           itemid valuenum
        <dbl>    <dbl> <dttm>               <dbl>    <dbl>
 1   10013310 32769810 2153-06-11 08:00:00 223761     98.8
 2   10013310 32769810 2153-06-11 09:00:00 220045    113  
 3   10013310 32769810 2153-06-11 09:00:00 220210     26  
 4   10013310 32769810 2153-06-11 09:02:00 220179    131  
 5   10013310 32769810 2153-06-11 09:02:00 220180     62  
 6   10013310 32769810 2153-06-12 07:00:00 220045    121  
 7   10013310 32769810 2153-06-12 07:00:00 220210     25  
 8   10013310 32769810 2153-06-12 07:03:00 220179    134  
 9   10013310 32769810 2153-06-12 07:03:00 220180     70  
10   10013310 32769810 2153-06-12 08:00:00 223761     99  
# ℹ 539 more rows
# merge chartevents_10013310 and d_items
chartevents_d_10013310 <- events_10013310 %>%
  left_join(d_item, by = "itemid")

chartevents_d_10013310
# A tibble: 549 × 6
   subject_id  stay_id charttime           itemid valuenum abbreviation 
        <dbl>    <dbl> <dttm>               <dbl>    <dbl> <chr>        
 1   10013310 32769810 2153-06-11 08:00:00 223761     98.8 Temperature F
 2   10013310 32769810 2153-06-11 09:00:00 220045    113   HR           
 3   10013310 32769810 2153-06-11 09:00:00 220210     26   RR           
 4   10013310 32769810 2153-06-11 09:02:00 220179    131   NBPs         
 5   10013310 32769810 2153-06-11 09:02:00 220180     62   NBPd         
 6   10013310 32769810 2153-06-12 07:00:00 220045    121   HR           
 7   10013310 32769810 2153-06-12 07:00:00 220210     25   RR           
 8   10013310 32769810 2153-06-12 07:03:00 220179    134   NBPs         
 9   10013310 32769810 2153-06-12 07:03:00 220180     70   NBPd         
10   10013310 32769810 2153-06-12 08:00:00 223761     99   Temperature F
# ℹ 539 more rows
# draw the ICU vitals for patient 10013310 and put different vitals in different facets
plot <- ggplot(chartevents_d_10013310, aes(x = charttime, y = valuenum, color = abbreviation)) +
  geom_line() +
  geom_point() +
  facet_grid(abbreviation ~ stay_id, scales = "free", switch = "y") +
  labs(title = "Patient 10013310 ICU stays - Vitals") +
  theme(axis.title.y = element_blank(),
        panel.background = element_rect(fill = "white"),
        panel.grid = element_line(color = "gray", linewidth = 0.5),
        panel.border = element_rect(color = "grey", fill = NA, linewidth = 0.8),
        strip.text = element_text(size = 10)) +  # Adjust the size of facet titles
  scale_color_brewer(palette = "Set1") +
  theme(axis.title.x = element_blank()) + 
  guides(color = FALSE)

ggsave("10013310_icu.png", plot, width = 8, height = 5)

The visualization for the patient with subject_id 10013310 is shown below.

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

# import icustays data
icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz") |> 
  as_tibble() |>
  print(width = Inf)
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los
   <dttm>              <dbl>
 1 2180-07-23 23:50:47 0.410
 2 2189-06-27 20:38:27 0.498
 3 2157-11-21 22:08:00 1.12 
 4 2157-12-20 14:27:41 0.948
 5 2110-04-12 23:59:56 1.34 
 6 2131-01-20 08:27:30 9.17 
 7 2160-05-19 17:33:33 1.31 
 8 2131-03-10 18:09:21 0.859
 9 2129-08-10 17:02:38 6.18 
10 2130-09-27 22:13:41 3.89 
# ℹ 73,171 more rows

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

Answer: There are 50920 unique subject_id and 12488 subject_id have multiple ICU stays. The number of ICU stays per subject_id is summarized by the following graph.

# number of unique subject_id
icustays_tble %>% 
  distinct(subject_id) %>% 
  nrow()
[1] 50920
# whether a subject_id can have multiple ICU stays
icustays_tble %>% 
  count(subject_id) %>% 
  filter(n > 1) %>% 
  nrow()
[1] 12448
# number of ICU stays per subject_id
icustays_tble %>% 
  count(subject_id) %>% 
  ggplot(aes(x = n)) +
  geom_bar() +
  labs(title = "Number of ICU stays per subject_id", 
       x = "Number of ICU stays", y = "Frequency") +
  theme_minimal() + 
  scale_x_continuous(limits = c(1, 10)) +
  scale_y_continuous(limits = c(0, 10000))
Warning: Removed 101 rows containing non-finite values (`stat_count()`).
Warning: Removed 2 rows containing missing values (`geom_bar()`).

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

# import admissions data
admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz") |> 
  as_tibble() |>
  print(width = Inf)
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime           deathtime
        <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00 NA       
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00 NA       
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00 NA       
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00 NA       
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00 NA       
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00 NA       
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00 NA       
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00 NA       
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00 NA       
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00 NA       
   admission_type    admit_provider_id admission_location     discharge_location
   <chr>             <chr>             <chr>                  <chr>             
 1 URGENT            P874LG            TRANSFER FROM HOSPITAL HOME              
 2 EW EMER.          P09Q6Y            EMERGENCY ROOM         HOME              
 3 EW EMER.          P60CC5            EMERGENCY ROOM         HOSPICE           
 4 EW EMER.          P30KEH            EMERGENCY ROOM         HOME              
 5 EU OBSERVATION    P51VDL            EMERGENCY ROOM         <NA>              
 6 EW EMER.          P6957U            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
 7 EU OBSERVATION    P63AD6            PHYSICIAN REFERRAL     <NA>              
 8 EU OBSERVATION    P38XXV            EMERGENCY ROOM         <NA>              
 9 EU OBSERVATION    P2358X            EMERGENCY ROOM         <NA>              
10 OBSERVATION ADMIT P75S70            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
   insurance language marital_status race  edregtime          
   <chr>     <chr>    <chr>          <chr> <dttm>             
 1 Other     ENGLISH  WIDOWED        WHITE 2180-05-06 19:17:00
 2 Medicaid  ENGLISH  WIDOWED        WHITE 2180-06-26 15:54:00
 3 Medicaid  ENGLISH  WIDOWED        WHITE 2180-08-05 20:58:00
 4 Medicaid  ENGLISH  WIDOWED        WHITE 2180-07-23 05:54:00
 5 Other     ENGLISH  SINGLE         WHITE 2160-03-03 21:55:00
 6 Medicare  ENGLISH  MARRIED        WHITE 2160-11-20 20:36:00
 7 Medicare  ENGLISH  MARRIED        WHITE 2160-12-27 18:32:00
 8 Other     ENGLISH  SINGLE         WHITE 2163-09-27 16:18:00
 9 Other     ENGLISH  DIVORCED       WHITE 2181-11-14 21:51:00
10 Other     ENGLISH  DIVORCED       WHITE 2183-09-18 08:41:00
   edouttime           hospital_expire_flag
   <dttm>                             <dbl>
 1 2180-05-06 23:30:00                    0
 2 2180-06-26 21:31:00                    0
 3 2180-08-06 01:44:00                    0
 4 2180-07-23 14:00:00                    0
 5 2160-03-04 06:26:00                    0
 6 2160-11-21 03:20:00                    0
 7 2160-12-28 16:07:00                    0
 8 2163-09-28 09:04:00                    0
 9 2181-11-15 09:57:00                    0
10 2183-09-18 20:20:00                    0
# ℹ 431,221 more rows

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

# number of admissions per patient
admissions_tble %>% 
  count(subject_id) %>% 
  ggplot(aes(x = n)) +
  geom_bar() +
  labs(title = "Number of admissions per patient", 
       x = "Number of admissions", y = "Frequency") +
  theme_minimal()

# admission hour
admissions_tble %>%
  mutate(admission_hour = hour(admittime)) %>%
  count(admission_hour) %>%
  ggplot(aes(x = admission_hour, y = n)) +
  geom_col() +
  labs(title = "Distribution of Admission Hour", 
       x = "Admission Hour", 
       y = "Frequency") +
  theme_minimal()

# admission minute
admissions_tble %>%
  mutate(admission_minute = minute(admittime)) %>%
  count(admission_minute) %>%
  ggplot(aes(x = admission_minute, y = n)) +
  geom_col() +
  labs(title = "Distribution of Admission minute", 
       x = "Admission minute", 
       y = "Frequency") +
  theme_minimal()

# length of hospital stay
admissions_tble %>% 
  mutate(length_of_stay = as.numeric(difftime(dischtime, admittime, units = "hours"))) %>% 
  count(length_of_stay) %>%
  ggplot(aes(x = length_of_stay)) +
  geom_histogram() +
  labs(title = "Distribution of length of hospital stay", 
       x = "Length of hospital stay (hour)", y = "Frequency") +
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

# import patients data
patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz") |> 
  as_tibble() |>
  print(width = Inf)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 299,712 × 6
   subject_id gender anchor_age anchor_year anchor_year_group dod       
        <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
 1   10000032 F              52        2180 2014 - 2016       2180-09-09
 2   10000048 F              23        2126 2008 - 2010       NA        
 3   10000068 F              19        2160 2008 - 2010       NA        
 4   10000084 M              72        2160 2017 - 2019       2161-02-13
 5   10000102 F              27        2136 2008 - 2010       NA        
 6   10000108 M              25        2163 2014 - 2016       NA        
 7   10000115 M              24        2154 2017 - 2019       NA        
 8   10000117 F              48        2174 2008 - 2010       NA        
 9   10000178 F              59        2157 2017 - 2019       NA        
10   10000248 M              34        2192 2014 - 2016       NA        
# ℹ 299,702 more rows

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

# Summary of gender
gender_summary <- patients_tble %>%
  count(gender) %>%
  mutate(percent = n / sum(n) * 100)

# Plot gender distribution
ggplot(gender_summary, aes(x = gender, y = percent, fill = gender)) +
  geom_bar(stat = "identity") +
  labs(title = "Gender Distribution of Patients", x = "Gender", y = "Percentage") +
  theme_minimal()

# Summary of anchor_age
anchor_age_summary <- patients_tble %>%
  group_by(anchor_age) %>%
  summarise(count = n())

# Plot anchor_age distribution
ggplot(anchor_age_summary, aes(x = anchor_age, y = count)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Anchor Age", x = "Anchor Age", y = "Count") +
  theme_minimal()

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head
itemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

# import labevents and d_labitems data
labevents_tble <- labevents |> 
  dplyr::select(labevent_id, subject_id, itemid, storetime, valuenum) |> 
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
d_labitems <- read_csv("~/mimic/hosp/d_labitems.csv.gz")
Rows: 1622 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): label, fluid, category
dbl (1): itemid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# restrict to the last available measurement (by `storetime`) before the ICU stay.
labevents_tble <- labevents_tble %>%
  group_by(subject_id, itemid) %>%
  filter(storetime == max(storetime)) %>%
  ungroup()
# merge labevents and d_labitems
labevents_item <- labevents_tble %>%
  left_join(d_labitems, by = "itemid")
labevents_item %>% 
  distinct(subject_id) %>% 
  nrow()
[1] 245723
# The final `labevents_tble` should have one row per ICU stay and columns for each lab measurement.
labevents_tble <- labevents_item %>%
  pivot_wider(names_from = label, values_from = valuenum)
# merge the lines with the same subject_id
labevents_tble1 <- labevents_tble %>%
  group_by(subject_id) %>%
  summarise(across(everything(), ~ first(.)))

labevents_tble1
# A tibble: 245,723 × 14
   subject_id labevent_id itemid storetime           fluid category  Bicarbonate
        <int>       <int>  <dbl> <dttm>              <chr> <chr>           <dbl>
 1   10000032         604  50882 2180-08-10 05:53:00 Blood Chemistry          23
 2   10000048         647  50882 2126-11-22 13:32:00 Blood Chemistry          26
 3   10000084         866  50882 2160-12-27 12:13:00 Blood Chemistry          24
 4   10000108         950  51221 2163-09-27 12:12:00 Blood Hematolo…          NA
 5   10000117        1754  50931 2184-09-02 10:28:00 Blood Chemistry          NA
 6   10000248        1810  50882 2192-11-29 12:09:00 Blood Chemistry          28
 7   10000280        1872  51221 2151-03-17 20:42:00 Blood Hematolo…          NA
 8   10000285        1945  50882 2161-11-08 08:52:00 Blood Chemistry          24
 9   10000473        2022  50882 2138-03-15 16:56:00 Blood Chemistry          25
10   10000492        2066  50882 2129-08-04 07:21:00 Blood Chemistry          19
# ℹ 245,713 more rows
# ℹ 7 more variables: Chloride <dbl>, Creatinine <dbl>, Glucose <dbl>,
#   Potassium <dbl>, Sodium <dbl>, Hematocrit <dbl>, `White Blood Cells` <dbl>

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

  • Length of ICU stay los vs the last available lab measurements before ICU stay

  • Length of ICU stay los vs the average vital measurements within the first hour of ICU stay

  • Length of ICU stay los vs first ICU unit